Extracting Patents that cite Publications from a chosen Research Organization¶
This tutorial shows how to extract and analyse patents information linked to a selected research organization, using the Dimensions Analytics API.
Load libraries and log in¶
[23]:
# @markdown # Get the API library and login
# @markdown **Privacy tip**: leave the password blank and you'll be asked for it later. This can be handy on shared computers.
username = "" #@param {type: "string"}
password = "" #@param {type: "string"}
endpoint = "https://app.dimensions.ai" #@param {type: "string"}
# import all libraries and login
!pip install dimcli plotly_express -U --quiet
import dimcli
from dimcli.shortcuts import *
dimcli.login(username, password, endpoint)
dsl = dimcli.Dsl()
#
import os
import sys
import time
import json
import pandas as pd
from pandas.io.json import json_normalize
from tqdm.notebook import tqdm as progressbar
#
# charts lib
import plotly_express as px
if not 'google.colab' in sys.modules:
# make js dependecies local / needed by html exports
from plotly.offline import init_notebook_mode
init_notebook_mode(connected=True)
DimCli v0.6.2.2 - Succesfully connected to <https://app.dimensions.ai> (method: dsl.ini file)
A couple of utility functions to simplify exporting CSV files to a selected folder
[3]:
#
# data-saving utils
#
DATAFOLDER = "extraction1"
#
if not os.path.exists(DATAFOLDER):
!mkdir $DATAFOLDER
print(f"==\nCreated data folder:", DATAFOLDER + "/")
#
def save_as_csv(df, save_name_without_extension):
"usage: `save_as_csv(dataframe, 'filename')`"
df.to_csv(f"{DATAFOLDER}/{save_name_without_extension}.csv", index=False)
print("===\nSaved: ", f"{DATAFOLDER}/{save_name_without_extension}.csv")
Choose a GRID Research Organization¶
For the purpose of this exercise, we will are going to use grid.89170.37. Feel free though to change the parameters below as you want, eg by choosing another GRID organization.
[4]:
GRIDID = "grid.89170.37" #@param {type:"string"}
#@markdown The start/end year of publications used to extract patents
YEAR_START = 2000 #@param {type: "slider", min: 1950, max: 2020}
YEAR_END = 2016 #@param {type: "slider", min: 1950, max: 2020}
if YEAR_END < YEAR_START:
YEAR_END = YEAR_START
#
# gen link to Dimensions
#
def dimensions_url(grids):
root = "https://app.dimensions.ai/discover/publication?or_facet_research_org="
return root + "&or_facet_research_org=".join([x for x in grids])
from IPython.core.display import display, HTML
display(HTML('---<br /><a href="{}">Open in Dimensions ⧉</a>'.format(dimensions_url([GRIDID]))))
#@markdown ---
1 - Prerequisite: Extracting Publications Data¶
By looking at the Dimensions API data model, we can see that the connection between Patents and Publications is represented via a directed arrow going from Patents to Publications: that means that we should look for patents records where the publication_ids field contain references to the GRID-publications we are interested in.
Hence, we need to * a) extract all publications linked to one (or more) GRID IDs, and * b) use these publications to extract patents referencing those publications.
[5]:
# Get full list of publications linked to this organization for the selected time frame
q = f"""search publications
where research_orgs.id="{GRIDID}"
and year in [{YEAR_START}:{YEAR_END}]
return publications[basics+category_for+times_cited]"""
pubs_json = dsl.query_iterative(q)
pubs = pubs_json.as_dataframe()
# save the data
save_as_csv(pubs, f"pubs_{GRIDID}")
1000 / 17204
2000 / 17204
3000 / 17204
4000 / 17204
5000 / 17204
6000 / 17204
7000 / 17204
8000 / 17204
9000 / 17204
10000 / 17204
11000 / 17204
12000 / 17204
13000 / 17204
14000 / 17204
15000 / 17204
16000 / 17204
17000 / 17204
17204 / 17204
===
Saved: extraction1/pubs_grid.89170.37.csv
How many publications per year?¶
Let’s have a quick look a the publication volume per year.
[6]:
px.histogram(pubs, x="year", y="id", color="type", barmode="group", title=f"Publication by year from {GRIDID}")
What are the main subject areas?¶
We can use the Field of Research categories information in publications to obtain a breakdown of the publications by subject areas.
This can be achieved by ‘exploding’ the category_for data into a separate table, since there can be more than one category per publication. The new categories table also retains some basic info about the publications it relates to eg journal, title, publication id etc.. so to make it easier to analyse the data.
[7]:
# ensure key exists in all rows (even if empty)
normalize_key("category_for", pubs_json.publications)
normalize_key("journal", pubs_json.publications)
# explode subjects into separate table
pubs_subjects = json_normalize(pubs_json.publications, record_path=['category_for'],
meta=["id", "type", ["journal", "title"], "year"],
errors='ignore', record_prefix='for_')
# add a new column: category name without digits for better readability
pubs_subjects['topic'] = pubs_subjects['for_name'].apply(lambda x: ''.join([i for i in x if not i.isdigit()]))
Now we can build a scatter plot that shows the amount and distribution of categories of the years.
[22]:
px.scatter(pubs_subjects, x="year", y="topic", color="type",
hover_name="for_name",
height=800,
marginal_x="histogram", marginal_y="histogram",
title=f"Top publication subjects for {GRIDID} (marginal subplots = X/Y totals)")
2 - Extracting Patents linked to Publications¶
In this section we extract all patents linked to the publications dataset previously created. The steps are the following:
we loop over the publication IDs and create patents queries, via the referencing
publication_idsfield of patentswe collate all patens data, remove duplicates from patents and save the results
finally, we count patents per publication and enrich the original publication dataset with these numbers
[24]:
#
# the main query
#
q = """search patents where publication_ids in {}
return patents[basics+publication_ids+FOR]"""
from dimcli.shortcuts import chunks_of
#
# let's loop through all grants IDs in chunks and query Dimensions
#
print("===\nExtracting patents data ...")
patents_json = []
BATCHSIZE = 400
VERBOSE = False # set to True to see patents extraction logs
pubsids = pubs['id']
for chunk in progressbar(list(chunks_of(list(pubsids), 400))):
data = dsl.query_iterative(q.format(json.dumps(chunk)), verbose=VERBOSE)
patents_json += data.patents
time.sleep(1)
#
# put the patents data into a dataframe, remove duplicates and save
#
patents = pd.DataFrame().from_dict(patents_json)
print("Patents found: ", len(patents))
patents.drop_duplicates(subset='id', inplace=True)
print("Unique Patents found: ", len(patents))
# save
save_as_csv(patents, f"patents_{GRIDID}")
# turning lists into strings to ensure compatibility with CSV loaded data
# see also: https://stackoverflow.com/questions/23111990/pandas-dataframe-stored-list-as-string-how-to-convert-back-to-list
patents['publication_ids'] = patents['publication_ids'].apply(lambda x: ','.join(map(str, x)))
#
# count patents per publication and enrich the original dataset
#
def patents_per_pub(pubid):
global patents
return patents[patents['publication_ids'].str.contains(pubid)]
print("===\nCounting patents per publication...")
l = []
for x in progressbar(pubsids):
l.append(len(patents_per_pub(x)))
#
# enrich and save the publications data
#
pubs['patents'] = l
save_as_csv(pubs, f"pubs_{GRIDID}_enriched_patents.csv")
===
Extracting patents data ...
Patents found: 4661
Unique Patents found: 3968
===
Saved: extraction1/patents_grid.89170.37.csv
===
Counting patents per publication...
===
Saved: extraction1/pubs_grid.89170.37_enriched_patents.csv.csv
A quick look at the data¶
[25]:
# display top 3 rows
patents.head(3)
[25]:
| FOR | assignee_names | assignees | filing_status | granted_year | id | inventor_names | publication_date | publication_ids | times_cited | title | year | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | NaN | [VIB VZW, UNIV GENT, UNIV BRUSSEL VRIJE] | [{'id': 'grid.5342.0', 'city_name': 'Ghent', '... | Application | NaN | WO-2017153345-A1 | [TAVERNIER, JAN, VAN DER HEYDEN, José, DEVOOGD... | 2017-09-14 | pub.1019496911,pub.1006473292,pub.1067588797,p... | NaN | CD20 BINDING AGENTS AND USES THEREOF | 2017 |
| 1 | [{'id': '2921', 'name': '0912 Materials Engine... | [Taiwan Semiconductor Manufacturing Co (TSMC) ... | [{'id': 'grid.454156.7', 'city_name': 'Hsinchu... | Application | NaN | US-20160240719-A1 | [Meng-Yu Lin, Shih-Yen Lin, Si-Chen Lee, Samue... | 2016-08-18 | pub.1021290825,pub.1034244510,pub.1022541167 | 11.0 | Semiconductor Devices Comprising 2D-Materials ... | 2015 |
| 2 | [{'id': '2921', 'name': '0912 Materials Engine... | [Taiwan Semiconductor Manufacturing Co (TSMC) ... | [{'id': 'grid.454156.7', 'city_name': 'Hsinchu... | Grant | 2018.0 | US-9859115-B2 | [Meng-Yu Lin, Shih-Yen Lin, Si-Chen Lee, Samue... | 2018-01-02 | pub.1073831920,pub.1034244510,pub.1021290825,p... | NaN | Semiconductor devices comprising 2D-materials ... | 2015 |
NOTE The publications dataset now includes patents info too:
[26]:
pubs.sort_values("patents", ascending=False).head(3)
[26]:
| author_affiliations | category_for | id | issue | journal.id | journal.title | pages | times_cited | title | type | volume | year | patents | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 13620 | [[{'first_name': 'Leonidas C', 'last_name': 'P... | [{'id': '2921', 'name': '0912 Materials Engine... | pub.1008622444 | 2-3 | jour.1047200 | Organic Electronics | 113-121 | 55.0 | High efficiency molecular organic light-emitti... | article | 4 | 2003 | 237 |
| 16384 | [[{'first_name': 'M.', 'last_name': 'Friedman'... | [{'id': '2921', 'name': '0912 Materials Engine... | pub.1061127158 | 2 | jour.1123356 | IEEE Transactions on Microwave Theory and Tech... | 341-348 | 5.0 | Low-Loss RF Transport Over Long Distances | article | 49 | 2001 | 150 |
| 11680 | [[{'first_name': 'Igor L.', 'last_name': 'Medi... | [{'id': '2202', 'name': '02 Physical Sciences'... | pub.1012397724 | 6 | jour.1031408 | Nature Materials | 435-446 | 4199.0 | Quantum dot bioconjugates for imaging, labelli... | article | 4 | 2005 | 90 |
3 - Patents Data Analysis¶
Now that we have extracted all the data we need, let’s start exploring them by building a few visualizations.
How many patents per year?¶
[27]:
# PS is year correct as a patents field?
px.histogram(patents, x="year", y="id", color="filing_status",
barmode="group",
title=f"Patents referencing publications from {GRIDID} - by year")
Who is filing the patents?¶
This can be done by looking at the field assigness of patent. Since the field contains nested information, first we need to extract it into its own table (similarly to what we’ve done above with publications categories).
[33]:
# ensure the key exists in all rows (even if empty)
normalize_key('assignees', patents_json)
# explode assigness into separate table
patents_assignees = json_normalize(patents_json,
record_path=['assignees'],
meta=['id', 'year', 'title'],
meta_prefix="patent_")
top_assignees = patents_assignees.groupby(['name', 'country_name'], as_index=False).count().sort_values(by="patent_id", ascending=False)
# preview the data: ps the patent_id column is the COUNT of patents
top_assignees[['name', 'country_name', 'patent_id']].head()
[33]:
| name | country_name | patent_id | |
|---|---|---|---|
| 549 | United States Department of the Navy | United States | 550 |
| 552 | Universal Display Corporation | United States | 232 |
| 5 | AT&T (United States) | United States | 155 |
| 287 | Massachusetts Institute of Technology | United States | 121 |
| 565 | University of California System | United States | 111 |
[30]:
px.bar(top_assignees, x="name", y="patent_id",
hover_name="name", color="country_name",
height=900,
title=f"Top Assignees for patents referencing publications from {GRIDID}")
[44]:
px.scatter(patents_assignees, x="name", y="country_name",
color="patent_year", hover_name="name",
height = 1000,
hover_data=["id", "patent_id"], marginal_y="histogram",
title=f"Assignees for patents referencing publications from {GRIDID} - Yearly breakdown")
What are the publications most frequenlty referenced in patents?¶
[36]:
pubs_cited = pubs.query("patents > 0 ").sort_values('patents', ascending=False).copy()
pubs_cited.head()
[36]:
| author_affiliations | category_for | id | issue | journal.id | journal.title | pages | times_cited | title | type | volume | year | patents | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 13620 | [[{'first_name': 'Leonidas C', 'last_name': 'P... | [{'id': '2921', 'name': '0912 Materials Engine... | pub.1008622444 | 2-3 | jour.1047200 | Organic Electronics | 113-121 | 55.0 | High efficiency molecular organic light-emitti... | article | 4 | 2003 | 237 |
| 16384 | [[{'first_name': 'M.', 'last_name': 'Friedman'... | [{'id': '2921', 'name': '0912 Materials Engine... | pub.1061127158 | 2 | jour.1123356 | IEEE Transactions on Microwave Theory and Tech... | 341-348 | 5.0 | Low-Loss RF Transport Over Long Distances | article | 49 | 2001 | 150 |
| 11680 | [[{'first_name': 'Igor L.', 'last_name': 'Medi... | [{'id': '2202', 'name': '02 Physical Sciences'... | pub.1012397724 | 6 | jour.1031408 | Nature Materials | 435-446 | 4199.0 | Quantum dot bioconjugates for imaging, labelli... | article | 4 | 2005 | 90 |
| 14084 | [[{'first_name': 'Jyoti K.', 'last_name': 'Jai... | [{'id': '2581', 'name': '0601 Biochemistry and... | pub.1036950132 | 1 | jour.1115214 | Nature Biotechnology | 47-51 | 1466.0 | Long-term multiple color imaging of live cells... | article | 21 | 2003 | 69 |
| 11398 | [[{'first_name': 'Haizhou', 'last_name': 'Yin'... | [{'id': '2921', 'name': '0912 Materials Engine... | pub.1061591798 | 10 | jour.1019990 | IEEE Transactions on Electron Devices | 2207-2214 | 14.0 | Ultrathin Strained-SOI by Stress Balance on Co... | article | 52 | 2005 | 67 |
[37]:
px.bar(pubs_cited[:1000], color="type",
x="year", y="patents",
hover_name="title", hover_data=["journal.title"],
title=f"Top Publications from {GRIDID} mentioned in patents, by year of publication")
What are the main subject areas of referenced publications?¶
[38]:
THRESHOLD_PUBS = 1000
citedids = list(pubs_cited[:THRESHOLD_PUBS]['id'])
pubs_subjects_cited = pubs_subjects[pubs_subjects['id'].isin(citedids)]
[39]:
px.scatter(pubs_subjects_cited, x="year", y="topic", color="type",
hover_name="for_name",
height=800,
marginal_x="histogram", marginal_y="histogram",
title=f"Top {THRESHOLD_PUBS} {GRIDID} publications cited by patents - by subject area")
Is there a correlation from publication citations to patents citations?¶
Note: if the points on a scatterplot graph produce a lower-left-to-upper-right pattern (see below), that is indicative of a positive correlation between the two variables. This pattern means that when the score of one observation is high, we expect the score of the other observation to be high as well, and vice versa.
[40]:
px.scatter(pubs, x="patents", y="times_cited",
title=f"Citations of {GRIDID} publications from publications VS from patents")
Conclusions¶
In this Dimensions Analytics API tutorial we have seen how, starting from a GRID organization, it is possible to extract
publications from authors associated to this organization
patents citing those publications (from any organization)
We have also done a basic analysis of the citing patents dataset, using fields like citation year, assigness etc…
This only scratches the surface of the possible applications of publication-patents linkage data, but hopefully it’ll give you a few basic tools to get started building your own application!
Note
The Dimensions Analytics API allows to carry out sophisticated research data analytics tasks like the ones described on this website. Check out also the associated Github repository for examples, the source code of these tutorials and much more.